Búsqueda | Portal Regional de la BVS

1.

Enhancing missense variant pathogenicity prediction with protein language models using VariPred.

Lin, Weining; Wells, Jude; Wang, Zeyuan; Orengo, Christine; Martin, Andrew C R.

Sci Rep ; 14(1): 8136, 2024 04 07.

Artículo en Inglés | MEDLINE | ID: mdl-38584172

RESUMEN

Computational approaches for predicting the pathogenicity of genetic variants have advanced in recent years. These methods enable researchers to determine the possible clinical impact of rare and novel variants. Historically these prediction methods used hand-crafted features based on structural, evolutionary, or physiochemical properties of the variant. In this study we propose a novel framework that leverages the power of pre-trained protein language models to predict variant pathogenicity. We show that our approach VariPred (Variant impact Predictor) outperforms current state-of-the-art methods by using an end-to-end model that only requires the protein sequence as input. Using one of the best-performing protein language models (ESM-1b), we establish a robust classifier that requires no calculation of structural features or multiple sequence alignments. We compare the performance of VariPred with other representative models including 3Cnet, Polyphen-2, REVEL, MetaLR, FATHMM and ESM variant. VariPred performs as well as, or in most cases better than these other predictors using six variant impact prediction benchmarks despite requiring only sequence data and no pre-processing of the data.

Asunto(s)

Mutación Missense , Proteínas , Virulencia , Proteínas/genética , Secuencia de Aminoácidos , Biología Computacional/métodos

2.

CATH 2024: CATH-AlphaFlow Doubles the Number of Structures in CATH and Reveals Nearly 200 New Folds.

Waman, Vaishali P; Bordin, Nicola; Alcraft, Rachel; Vickerstaff, Robert; Rauer, Clemens; Chan, Qian; Sillitoe, Ian; Yamamori, Hazuki; Orengo, Christine.

J Mol Biol ; : 168551, 2024 Mar 27.

Artículo en Inglés | MEDLINE | ID: mdl-38548261

RESUMEN

CATH (https://www.cathdb.info) classifies domain structures from experimental protein structures in the PDB and predicted structures in the AlphaFold Database (AFDB). To cope with the scale of the predicted data a new NextFlow workflow (CATH-AlphaFlow), has been developed to classify high-quality domains into CATH superfamilies and identify novel fold groups and superfamilies. CATH-AlphaFlow uses a novel state-of-the-art structure-based domain boundary prediction method (ChainSaw) for identifying domains in multi-domain proteins. We applied CATH-AlphaFlow to process PDB structures not classified in CATH and AFDB structures from 21 model organisms, expanding CATH by over 100%. Domains not classified in existing CATH superfamilies or fold groups were used to seed novel folds, giving 253 new folds from PDB structures (September 2023 release) and 96 from AFDB structures of proteomes of 21 model organisms. Where possible, functional annotations were obtained using (i) predictions from publicly available methods (ii) annotations from structural relatives in AFDB/UniProt50. We also predicted functional sites and highly conserved residues. Some folds are associated with important functions such as photosynthetic acclimation (in flowering plants), iron permease activity (in fungi) and post-natal spermatogenesis (in mice). CATH-AlphaFlow will allow us to identify many more CATH relatives in the AFDB, further characterising the protein structure landscape.

3.

FunPredCATH: An ensemble method for predicting protein function using CATH.

Bonello, Joseph; Orengo, Christine.

Biochim Biophys Acta Proteins Proteom ; 1872(2): 140985, 2024 02 01.

Artículo en Inglés | MEDLINE | ID: mdl-38122964

RESUMEN

MOTIVATION: The growth of unannotated proteins in UniProt increases at a very high rate every year due to more efficient sequencing methods. However, the experimental annotation of proteins is a lengthy and expensive process. Using computational techniques to narrow the search can speed up the process by providing highly specific Gene Ontology (GO) terms. METHODOLOGY: We propose an ensemble approach that combines three generic base predictors that predict Gene Ontology (BP, CC and MF) terms from sequences across different species. We train our models on UniProtGOA annotation data and use the CATH domain resources to identify the protein families. We then calculate a score based on the prevalence of individual GO terms in the functional families that is then used as an indicator of confidence when assigning the GO term to an uncharacterised protein. METHODS: In the ensemble, we use a statistics-based method that scores the occurrence of GO terms in a CATH FunFam against a background set of proteins annotated by the same GO term. We also developed a set-based method that uses Set Intersection and Set Union to score the occurrence of GO terms within the same CATH FunFam. Finally, we also use FunFams-Plus, a predictor method developed by the Orengo Group at UCL to predict GO terms for uncharacterised proteins in the CAFA3 challenge. EVALUATION: We evaluated the methods against the CAFA3 benchmark and DomFun. We used the Precision, Recall and Fmax metrics and the benchmark datasets that are used in CAFA3 to evaluate our models and compare them to the CAFA3 results. Our results show that FunPredCATH compares well with top CAFA methods in the different ontologies and benchmarks. CONTRIBUTIONS: FunPredCATH compares well with other prediction methods on CAFA3, and the ensemble approach outperforms the base methods. We show that non-IEA models obtain higher Fmax scores than the IEA counterparts, while the models including IEA annotations have higher coverage at the expense of a lower Fmax score.

Asunto(s)

Proteínas , Análisis de Secuencia de Proteína , Bases de Datos de Proteínas , Proteínas/metabolismo , Anotación de Secuencia Molecular , Análisis de Secuencia de Proteína/métodos , Ontología de Genes

4.

Large-scale clustering of AlphaFold2 3D models shines light on the structure and function of proteins.

Bordin, Nicola; Lau, Andy M; Orengo, Christine.

Mol Cell ; 83(22): 3950-3952, 2023 Nov 16.

Artículo en Inglés | MEDLINE | ID: mdl-37977115

RESUMEN

Two recent studies exploited ultra-fast structural aligners and deep-learning approaches to cluster the protein structure space in the AlphaFold Database. Barrio-Hernandez et al.1 and Durairaj et al.2 uncovered fascinating new protein functions and structural features previously unknown.

Asunto(s)

Análisis por Conglomerados , Bases de Datos Factuales

5.

Broad functional profiling of fission yeast proteins using phenomics and machine learning.

Rodríguez-López, María; Bordin, Nicola; Lees, Jon; Scholes, Harry; Hassan, Shaimaa; Saintain, Quentin; Kamrad, Stephan; Orengo, Christine; Bähler, Jürg.

Elife ; 122023 10 03.

Artículo en Inglés | MEDLINE | ID: mdl-37787768

RESUMEN

Many proteins remain poorly characterized even in well-studied organisms, presenting a bottleneck for research. We applied phenomics and machine-learning approaches with Schizosaccharomyces pombe for broad cues on protein functions. We assayed colony-growth phenotypes to measure the fitness of deletion mutants for 3509 non-essential genes in 131 conditions with different nutrients, drugs, and stresses. These analyses exposed phenotypes for 3492 mutants, including 124 mutants of 'priority unstudied' proteins conserved in humans, providing varied functional clues. For example, over 900 proteins were newly implicated in the resistance to oxidative stress. Phenotype-correlation networks suggested roles for poorly characterized proteins through 'guilt by association' with known proteins. For complementary functional insights, we predicted Gene Ontology (GO) terms using machine learning methods exploiting protein-network and protein-homology data (NET-FF). We obtained 56,594 high-scoring GO predictions, of which 22,060 also featured high information content. Our phenotype-correlation data and NET-FF predictions showed a strong concordance with existing PomBase GO annotations and protein networks, with integrated analyses revealing 1675 novel GO predictions for 783 genes, including 47 predictions for 23 priority unstudied proteins. Experimental validation identified new proteins involved in cellular aging, showing that these predictions and phenomics data provide a rich resource to uncover new protein functions.

Asunto(s)

Proteínas de Schizosaccharomyces pombe , Schizosaccharomyces , Humanos , Fenómica , Proteínas de Schizosaccharomyces pombe/genética , Fenotipo , Schizosaccharomyces/genética , Aprendizaje Automático

6.

Protein diversification through post-translational modifications, alternative splicing, and gene duplication.

Goldtzvik, Yonathan; Sen, Neeladri; Lam, Su Datt; Orengo, Christine.

Curr Opin Struct Biol ; 81: 102640, 2023 08.

Artículo en Inglés | MEDLINE | ID: mdl-37354790

RESUMEN

Proteins provide the basis for cellular function. Having multiple versions of the same protein within a single organism provides a way of regulating its activity or developing novel functions. Post-translational modifications of proteins, by means of adding/removing chemical groups to amino acids, allow for a well-regulated and controlled way of generating functionally distinct protein species. Alternative splicing is another method with which organisms possibly generate new isoforms. Additionally, gene duplication events throughout evolution generate multiple paralogs of the same genes, resulting in multiple versions of the same protein within an organism. In this review, we discuss recent advancements in the study of these three methods of protein diversification and provide illustrative examples of how they affect protein structure and function.

Asunto(s)

Empalme Alternativo , Duplicación de Gen , Evolución Molecular , Isoformas de Proteínas/genética , Procesamiento Proteico-Postraduccional

7.

The opportunities and challenges posed by the new generation of deep learning-based protein structure predictors.

Varadi, Mihaly; Bordin, Nicola; Orengo, Christine; Velankar, Sameer.

Curr Opin Struct Biol ; 79: 102543, 2023 04.

Artículo en Inglés | MEDLINE | ID: mdl-36807079

RESUMEN

The function of proteins can often be inferred from their three-dimensional structures. Experimental structural biologists spent decades studying these structures, but the accelerated pace of protein sequencing continuously increases the gaps between sequences and structures. The early 2020s saw the advent of a new generation of deep learning-based protein structure prediction tools that offer the potential to predict structures based on any number of protein sequences. In this review, we give an overview of the impact of this new generation of structure prediction tools, with examples of the impacted field in the life sciences. We discuss the novel opportunities and new scientific and technical challenges these tools present to the broader scientific community. Finally, we highlight some potential directions for the future of computational protein structure prediction.

Asunto(s)

Aprendizaje Profundo , Biología Computacional/métodos , Proteínas/química , Secuencia de Aminoácidos

8.

ModelCIF: An Extension of PDBx/mmCIF Data Representation for Computed Structure Models.

Vallat, Brinda; Tauriello, Gerardo; Bienert, Stefan; Haas, Juergen; Webb, Benjamin M; Zídek, Augustin; Zheng, Wei; Peisach, Ezra; Piehl, Dennis W; Anischanka, Ivan; Sillitoe, Ian; Tolchard, James; Varadi, Mihaly; Baker, David; Orengo, Christine; Zhang, Yang; Hoch, Jeffrey C; Kurisu, Genji; Patwardhan, Ardan; Velankar, Sameer; Burley, Stephen K; Sali, Andrej; Schwede, Torsten; Berman, Helen M; Westbrook, John D.

J Mol Biol ; 435(14): 168021, 2023 07 15.

Artículo en Inglés | MEDLINE | ID: mdl-36828268

RESUMEN

ModelCIF (github.com/ihmwg/ModelCIF) is a data information framework developed for and by computational structural biologists to enable delivery of Findable, Accessible, Interoperable, and Reusable (FAIR) data to users worldwide. ModelCIF describes the specific set of attributes and metadata associated with macromolecular structures modeled by solely computational methods and provides an extensible data representation for deposition, archiving, and public dissemination of predicted three-dimensional (3D) models of macromolecules. It is an extension of the Protein Data Bank Exchange / macromolecular Crystallographic Information Framework (PDBx/mmCIF), which is the global data standard for representing experimentally-determined 3D structures of macromolecules and associated metadata. The PDBx/mmCIF framework and its extensions (e.g., ModelCIF) are managed by the Worldwide Protein Data Bank partnership (wwPDB, wwpdb.org) in collaboration with relevant community stakeholders such as the wwPDB ModelCIF Working Group (wwpdb.org/task/modelcif). This semantically rich and extensible data framework for representing computed structure models (CSMs) accelerates the pace of scientific discovery. Herein, we describe the architecture, contents, and governance of ModelCIF, and tools and processes for maintaining and extending the data standard. Community tools and software libraries that support ModelCIF are also described.

Asunto(s)

Bases de Datos de Proteínas , Sustancias Macromoleculares/química , Conformación Proteica , Programas Informáticos

9.

KinFams: De-Novo Classification of Protein Kinases Using CATH Functional Units.

Adeyelu, Tolulope; Bordin, Nicola; Waman, Vaishali P; Sadlej, Marta; Sillitoe, Ian; Moya-Garcia, Aurelio A; Orengo, Christine A.

Biomolecules ; 13(2)2023 02 02.

Artículo en Inglés | MEDLINE | ID: mdl-36830646

RESUMEN

Protein kinases are important targets for treating human disorders, and they are the second most targeted families after G-protein coupled receptors. Several resources provide classification of kinases into evolutionary families (based on sequence homology); however, very few systematically classify functional families (FunFams) comprising evolutionary relatives that share similar functional properties. We have developed the FunFam-MARC (Multidomain ARchitecture-based Clustering) protocol, which uses multi-domain architectures of protein kinases and specificity-determining residues for functional family classification. FunFam-MARC predicts 2210 kinase functional families (KinFams), which have increased functional coherence, in terms of EC annotations, compared to the widely used KinBase classification. Our protocol provides a comprehensive classification for kinase sequences from >10,000 organisms. We associate human KinFams with diseases and drugs and identify 28 druggable human KinFams, i.e., enriched in clinically approved drugs. Since relatives in the same druggable KinFam tend to be structurally conserved, including the drug-binding site, these KinFams may be valuable for shortlisting therapeutic targets. Information on the human KinFams and associated 3D structures from AlphaFold2 are provided via our CATH FTP website and Zenodo. This gives the domain structure representative of each KinFam together with information on any drug compounds available. For 32% of the KinFams, we provide information on highly conserved residue sites that may be associated with specificity.

Asunto(s)

Proteínas Quinasas , Proteínas , Humanos , Proteínas Quinasas/metabolismo , Proteínas/química , Bases de Datos de Proteínas , Homología de Secuencia de Aminoácido

10.

AlphaFold2 reveals commonalities and novelties in protein structure space for 21 model organisms.

Bordin, Nicola; Sillitoe, Ian; Nallapareddy, Vamsi; Rauer, Clemens; Lam, Su Datt; Waman, Vaishali P; Sen, Neeladri; Heinzinger, Michael; Littmann, Maria; Kim, Stephanie; Velankar, Sameer; Steinegger, Martin; Rost, Burkhard; Orengo, Christine.

Commun Biol ; 6(1): 160, 2023 02 08.

Artículo en Inglés | MEDLINE | ID: mdl-36755055

RESUMEN

Deep-learning (DL) methods like DeepMind's AlphaFold2 (AF2) have led to substantial improvements in protein structure prediction. We analyse confident AF2 models from 21 model organisms using a new classification protocol (CATH-Assign) which exploits novel DL methods for structural comparison and classification. Of ~370,000 confident models, 92% can be assigned to 3253 superfamilies in our CATH domain superfamily classification. The remaining cluster into 2367 putative novel superfamilies. Detailed manual analysis on 618 of these, having at least one human relative, reveal extremely remote homologies and further unusual features. Only 25 novel superfamilies could be confirmed. Although most models map to existing superfamilies, AF2 domains expand CATH by 67% and increases the number of unique 'global' folds by 36% and will provide valuable insights on structure function relationships. CATH-Assign will harness the huge expansion in structural data provided by DeepMind to rationalise evolutionary changes driving functional divergence.

Asunto(s)

Furilfuramida , Proteínas , Humanos , Bases de Datos de Proteínas , Proteínas/química

11.

CATHe: detection of remote homologues for CATH superfamilies using embeddings from protein language models.

Nallapareddy, Vamsi; Bordin, Nicola; Sillitoe, Ian; Heinzinger, Michael; Littmann, Maria; Waman, Vaishali P; Sen, Neeladri; Rost, Burkhard; Orengo, Christine.

Bioinformatics ; 39(1)2023 01 01.

Artículo en Inglés | MEDLINE | ID: mdl-36648327

RESUMEN

MOTIVATION: CATH is a protein domain classification resource that exploits an automated workflow of structure and sequence comparison alongside expert manual curation to construct a hierarchical classification of evolutionary and structural relationships. The aim of this study was to develop algorithms for detecting remote homologues missed by state-of-the-art hidden Markov model (HMM)-based approaches. The method developed (CATHe) combines a neural network with sequence representations obtained from protein language models. It was assessed using a dataset of remote homologues having less than 20% sequence identity to any domain in the training set. RESULTS: The CATHe models trained on 1773 largest and 50 largest CATH superfamilies had an accuracy of 85.6 ± 0.4% and 98.2 ± 0.3%, respectively. As a further test of the power of CATHe to detect more remote homologues missed by HMMs derived from CATH domains, we used a dataset consisting of protein domains that had annotations in Pfam, but not in CATH. By using highly reliable CATHe predictions (expected error rate <0.5%), we were able to provide CATH annotations for 4.62 million Pfam domains. For a subset of these domains from Homo sapiens, we structurally validated 90.86% of the predictions by comparing their corresponding AlphaFold2 structures with structures from the CATH superfamilies to which they were assigned. AVAILABILITY AND IMPLEMENTATION: The code for the developed models is available on https://github.com/vam-sin/CATHe, and the datasets developed in this study can be accessed on https://zenodo.org/record/6327572. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Algoritmos , Proteínas , Humanos , Homología de Secuencia de Aminoácido , Proteínas/química , Bases de Datos de Proteínas

12.

Mapping the Constrained Coding Regions in the Human Genome to Their Corresponding Proteins.

Hasenahuer, Marcia A; Sanchis-Juan, Alba; Laskowski, Roman A; Baker, James A; Stephenson, James D; Orengo, Christine A; Raymond, F Lucy; Thornton, Janet M.

J Mol Biol ; 435(2): 167892, 2023 01 30.

Artículo en Inglés | MEDLINE | ID: mdl-36410474

RESUMEN

Constrained Coding Regions (CCRs) in the human genome have been derived from DNA sequencing data of large cohorts of healthy control populations, available in the Genome Aggregation Database (gnomAD) [1]. They identify regions depleted of protein-changing variants and thus identify segments of the genome that have been constrained during human evolution. By mapping these DNA-defined regions from genomic coordinates onto the corresponding protein positions and combining this information with protein annotations, we have explored the distribution of CCRs and compared their co-occurrence with different protein functional features, previously annotated at the amino acid level in public databases. As expected, our results reveal that functional amino acids involved in interactions with DNA/RNA, protein-protein contacts and catalytic sites are the protein features most likely to be highly constrained for variation in the control population. More surprisingly, we also found that linear motifs, linear interacting peptides (LIPs), disorder-order transitions upon binding with other protein partners and liquid-liquid phase separating (LLPS) regions are also strongly associated with high constraint for variability. We also compared intra-species constraints in the human CCRs with inter-species conservation and functional residues to explore how such CCRs may contribute to the analysis of protein variants. As has been previously observed, CCRs are only weakly correlated with conservation, suggesting that intraspecies constraints complement interspecies conservation and can provide more information to interpret variant effects.

Asunto(s)

Genoma Humano , Sistemas de Lectura Abierta , Proteínas , Humanos , Secuencia de Bases , Genoma Humano/genética , Genómica , Proteínas/genética , Mapeo Cromosómico

13.

Novel machine learning approaches revolutionize protein knowledge.

Bordin, Nicola; Dallago, Christian; Heinzinger, Michael; Kim, Stephanie; Littmann, Maria; Rauer, Clemens; Steinegger, Martin; Rost, Burkhard; Orengo, Christine.

Trends Biochem Sci ; 48(4): 345-359, 2023 04.

Artículo en Inglés | MEDLINE | ID: mdl-36504138

RESUMEN

Breakthrough methods in machine learning (ML), protein structure prediction, and novel ultrafast structural aligners are revolutionizing structural biology. Obtaining accurate models of proteins and annotating their functions on a large scale is no longer limited by time and resources. The most recent method to be top ranked by the Critical Assessment of Structure Prediction (CASP) assessment, AlphaFold 2 (AF2), is capable of building structural models with an accuracy comparable to that of experimental structures. Annotations of 3D models are keeping pace with the deposition of the structures due to advancements in protein language models (pLMs) and structural aligners that help validate these transferred annotations. In this review we describe how recent developments in ML for protein science are making large-scale structural bioinformatics available to the general scientific community.

Asunto(s)

Aprendizaje Automático , Proteínas , Proteínas/química , Biología Computacional/métodos , Conformación Proteica

14.

InterPro in 2022.

Paysan-Lafosse, Typhaine; Blum, Matthias; Chuguransky, Sara; Grego, Tiago; Pinto, Beatriz Lázaro; Salazar, Gustavo A; Bileschi, Maxwell L; Bork, Peer; Bridge, Alan; Colwell, Lucy; Gough, Julian; Haft, Daniel H; Letunic, Ivica; Marchler-Bauer, Aron; Mi, Huaiyu; Natale, Darren A; Orengo, Christine A; Pandurangan, Arun P; Rivoire, Catherine; Sigrist, Christian J A; Sillitoe, Ian; Thanki, Narmada; Thomas, Paul D; Tosatto, Silvio C E; Wu, Cathy H; Bateman, Alex.

Nucleic Acids Res ; 51(D1): D418-D427, 2023 01 06.

Artículo en Inglés | MEDLINE | ID: mdl-36350672

RESUMEN

The InterPro database (https://www.ebi.ac.uk/interpro/) provides an integrative classification of protein sequences into families, and identifies functionally important domains and conserved sites. Here, we report recent developments with InterPro (version 90.0) and its associated software, including updates to data content and to the website. These developments extend and enrich the information provided by InterPro, and provide a more user friendly access to the data. Additionally, we have worked on adding Pfam website features to the InterPro website, as the Pfam website will be retired in late 2022. We also show that InterPro's sequence coverage has kept pace with the growth of UniProtKB. Moreover, we report the development of a card game as a method of engaging the non-scientific community. Finally, we discuss the benefits and challenges brought by the use of artificial intelligence for protein structure prediction.

Asunto(s)

Bases de Datos de Proteínas , Humanos , Secuencia de Aminoácidos , Inteligencia Artificial , Internet , Proteínas/química , Programas Informáticos

15.

Dissecting peripheral protein-membrane interfaces.

Tubiana, Thibault; Sillitoe, Ian; Orengo, Christine; Reuter, Nathalie.

PLoS Comput Biol ; 18(12): e1010346, 2022 12.

Artículo en Inglés | MEDLINE | ID: mdl-36516231

RESUMEN

Peripheral membrane proteins (PMPs) include a wide variety of proteins that have in common to bind transiently to the chemically complex interfacial region of membranes through their interfacial binding site (IBS). In contrast to protein-protein or protein-DNA/RNA interfaces, peripheral protein-membrane interfaces are poorly characterized. We collected a dataset of PMP domains representative of the variety of PMP functions: membrane-targeting domains (Annexin, C1, C2, discoidin C2, PH, PX), enzymes (PLA, PLC/D) and lipid-transfer proteins (START). The dataset contains 1328 experimental structures and 1194 AphaFold models. We mapped the amino acid composition and structural patterns of the IBS of each protein in this dataset, and evaluated which were more likely to be found at the IBS compared to the rest of the domains' accessible surface. In agreement with earlier work we find that about two thirds of the PMPs in the dataset have protruding hydrophobes (Leu, Ile, Phe, Tyr, Trp and Met) at their IBS. The three aromatic amino acids Trp, Tyr and Phe are a hallmark of PMPs IBS regardless of whether they protrude on loops or not. This is also the case for lysines but not arginines suggesting that, unlike for Arg-rich membrane-active peptides, the less membrane-disruptive lysine is preferred in PMPs. Another striking observation was the over-representation of glycines at the IBS of PMPs compared to the rest of their surface, possibly procuring IBS loops a much-needed flexibility to insert in-between membrane lipids. The analysis of the 9 superfamilies revealed amino acid distribution patterns in agreement with their known functions and membrane-binding mechanisms. Besides revealing novel amino acids patterns at protein-membrane interfaces, our work contributes a new PMP dataset and an analysis pipeline that can be further built upon for future studies of PMPs properties, or for developing PMPs prediction tools using for example, machine learning approaches.

Asunto(s)

Membrana Celular , Péptidos , Aminoácidos/química , Sitios de Unión , Péptidos/química , Membrana Celular/química

16.

3D-Beacons: decreasing the gap between protein sequences and structures through a federated network of protein structure data resources.

Varadi, Mihaly; Nair, Sreenath; Sillitoe, Ian; Tauriello, Gerardo; Anyango, Stephen; Bienert, Stefan; Borges, Clemente; Deshpande, Mandar; Green, Tim; Hassabis, Demis; Hatos, Andras; Hegedus, Tamas; Hekkelman, Maarten L; Joosten, Robbie; Jumper, John; Laydon, Agata; Molodenskiy, Dmitry; Piovesan, Damiano; Salladini, Edoardo; Salzberg, Steven L; Sommer, Markus J; Steinegger, Martin; Suhajda, Erzsebet; Svergun, Dmitri; Tenorio-Ku, Luiggi; Tosatto, Silvio; Tunyasuvunakool, Kathryn; Waterhouse, Andrew Mark; Zídek, Augustin; Schwede, Torsten; Orengo, Christine; Velankar, Sameer.

Gigascience ; 112022 11 30.

Artículo en Inglés | MEDLINE | ID: mdl-36448847

RESUMEN

While scientists can often infer the biological function of proteins from their 3-dimensional quaternary structures, the gap between the number of known protein sequences and their experimentally determined structures keeps increasing. A potential solution to this problem is presented by ever more sophisticated computational protein modeling approaches. While often powerful on their own, most methods have strengths and weaknesses. Therefore, it benefits researchers to examine models from various model providers and perform comparative analysis to identify what models can best address their specific use cases. To make data from a large array of model providers more easily accessible to the broader scientific community, we established 3D-Beacons, a collaborative initiative to create a federated network with unified data access mechanisms. The 3D-Beacons Network allows researchers to collate coordinate files and metadata for experimentally determined and theoretical protein models from state-of-the-art and specialist model providers and also from the Protein Data Bank.

Asunto(s)

Metadatos , Registros , Secuencia de Aminoácidos , Bases de Datos de Proteínas , Simulación por Computador

17.

Structural and energetic analyses of SARS-CoV-2 N-terminal domain characterise sugar binding pockets and suggest putative impacts of variants on COVID-19 transmission.

Lam, Su Datt; Waman, Vaishali P; Fraternali, Franca; Orengo, Christine; Lees, Jonathan.

Comput Struct Biotechnol J ; 20: 6302-6316, 2022.

Artículo en Inglés | MEDLINE | ID: mdl-36408455

RESUMEN

Coronavirus disease 2019 (COVID-19) caused by SARS-CoV-2 is an ongoing pandemic that causes significant health/socioeconomic burden. Variants of concern (VOCs) have emerged affecting transmissibility, disease severity and re-infection risk. Studies suggest that the - N-terminal domain (NTD) of the spike protein may have a role in facilitating virus entry via sialic-acid receptor binding. Furthermore, most VOCs include novel NTD variants. Despite global sequence and structure similarity, most sialic-acid binding pockets in NTD vary across coronaviruses. Our work suggests ongoing evolutionary tuning of the sugar-binding pockets and recent analyses have shown that NTD insertions in VOCs tend to lie close to loops. We extended the structural characterisation of these sugar-binding pockets and explored whether variants could enhance sialic acid-binding. We found that recent NTD insertions in VOCs (i.e., Gamma, Delta and Omicron variants) and emerging variants of interest (VOIs) (i.e., Iota, Lambda and Theta variants) frequently lie close to sugar-binding pockets. For some variants, including the recent Omicron VOC, we find increases in predicted sialic acid-binding energy, compared to the original SARS-CoV-2, which may contribute to increased transmission. These binding observations are supported by molecular dynamics simulations (MD). We examined the similarity of NTD across Betacoronaviruses to determine whether the sugar-binding pockets are sufficiently similar to be exploited in drug design. Whilst most pockets are too structurally variable, we detected a previously unknown highly structurally conserved pocket which can be investigated in pursuit of a generic pan-Betacoronavirus drug. Our structure-based analyses help rationalise the effects of VOCs and provide hypotheses for experiments. Our findings suggest a strong need for experimental monitoring of changes in NTD of VOCs.

18.

A roadmap for the functional annotation of protein families: a community perspective.

de Crécy-Lagard, Valérie; Amorin de Hegedus, Rocio; Arighi, Cecilia; Babor, Jill; Bateman, Alex; Blaby, Ian; Blaby-Haas, Crysten; Bridge, Alan J; Burley, Stephen K; Cleveland, Stacey; Colwell, Lucy J; Conesa, Ana; Dallago, Christian; Danchin, Antoine; de Waard, Anita; Deutschbauer, Adam; Dias, Raquel; Ding, Yousong; Fang, Gang; Friedberg, Iddo; Gerlt, John; Goldford, Joshua; Gorelik, Mark; Gyori, Benjamin M; Henry, Christopher; Hutinet, Geoffrey; Jaroch, Marshall; Karp, Peter D; Kondratova, Liudmyla; Lu, Zhiyong; Marchler-Bauer, Aron; Martin, Maria-Jesus; McWhite, Claire; Moghe, Gaurav D; Monaghan, Paul; Morgat, Anne; Mungall, Christopher J; Natale, Darren A; Nelson, William C; O'Donoghue, Seán; Orengo, Christine; O'Toole, Katherine H; Radivojac, Predrag; Reed, Colbie; Roberts, Richard J; Rodionov, Dmitri; Rodionova, Irina A; Rudolf, Jeffrey D; Saleh, Lana; Sheynkman, Gloria.

Database (Oxford) ; 20222022 08 12.

Artículo en Inglés | MEDLINE | ID: mdl-35961013

RESUMEN

Over the last 25 years, biology has entered the genomic era and is becoming a science of 'big data'. Most interpretations of genomic analyses rely on accurate functional annotations of the proteins encoded by more than 500 000 genomes sequenced to date. By different estimates, only half the predicted sequenced proteins carry an accurate functional annotation, and this percentage varies drastically between different organismal lineages. Such a large gap in knowledge hampers all aspects of biological enterprise and, thereby, is standing in the way of genomic biology reaching its full potential. A brainstorming meeting to address this issue funded by the National Science Foundation was held during 3-4 February 2022. Bringing together data scientists, biocurators, computational biologists and experimentalists within the same venue allowed for a comprehensive assessment of the current state of functional annotations of protein families. Further, major issues that were obstructing the field were identified and discussed, which ultimately allowed for the proposal of solutions on how to move forward.

Asunto(s)

Genómica , Proteínas , Secuencia de Bases , Biología Computacional , Genoma , Anotación de Secuencia Molecular

19.

Profiling the Site of Protein CoAlation and Coenzyme A Stabilization Interactions.

Tossounian, Maria-Armineh; Baczynska, Maria; Dalton, William; Newell, Charlie; Ma, Yilin; Das, Sayoni; Semelak, Jonathan Alexis; Estrin, Dario Ariel; Filonenko, Valeriy; Trujillo, Madia; Peak-Chew, Sew Yeu; Skehel, Mark; Fraternali, Franca; Orengo, Christine; Gout, Ivan.

Antioxidants (Basel) ; 11(7)2022 Jul 14.

Artículo en Inglés | MEDLINE | ID: mdl-35883853

RESUMEN

Coenzyme A (CoA) is a key cellular metabolite known for its diverse functions in metabolism and regulation of gene expression. CoA was recently shown to play an important antioxidant role under various cellular stress conditions by forming a disulfide bond with proteins, termed CoAlation. Using anti-CoA antibodies and liquid chromatography tandem mass spectrometry (LC-MS/MS) methodologies, CoAlated proteins were identified from various organisms/tissues/cell-lines under stress conditions. In this study, we integrated currently known CoAlated proteins into mammalian and bacterial datasets (CoAlomes), resulting in a total of 2093 CoAlated proteins (2862 CoAlation sites). Functional classification of these proteins showed that CoAlation is widespread among proteins involved in cellular metabolism, stress response and protein synthesis. Using 35 published CoAlated protein structures, we studied the stabilization interactions of each CoA segment (adenosine diphosphate (ADP) moiety and pantetheine tail) within the microenvironment of the modified cysteines. Alternating polar-non-polar residues, positively charged residues and hydrophobic interactions mainly stabilize the pantetheine tail, phosphate groups and the ADP moiety, respectively. A flexible nature of CoA is observed in examined structures, allowing it to adapt its conformation through interactions with residues surrounding the CoAlation site. Based on these findings, we propose three modes of CoA binding to proteins. Overall, this study summarizes currently available knowledge on CoAlated proteins, their functional distribution and CoA-protein stabilization interactions.

20.

Contrastive learning on protein embeddings enlightens midnight zone.

Heinzinger, Michael; Littmann, Maria; Sillitoe, Ian; Bordin, Nicola; Orengo, Christine; Rost, Burkhard.

NAR Genom Bioinform ; 4(2): lqac043, 2022 Jun.

Artículo en Inglés | MEDLINE | ID: mdl-35702380

RESUMEN

Experimental structures are leveraged through multiple sequence alignments, or more generally through homology-based inference (HBI), facilitating the transfer of information from a protein with known annotation to a query without any annotation. A recent alternative expands the concept of HBI from sequence-distance lookup to embedding-based annotation transfer (EAT). These embeddings are derived from protein Language Models (pLMs). Here, we introduce using single protein representations from pLMs for contrastive learning. This learning procedure creates a new set of embeddings that optimizes constraints captured by hierarchical classifications of protein 3D structures defined by the CATH resource. The approach, dubbed ProtTucker, has an improved ability to recognize distant homologous relationships than more traditional techniques such as threading or fold recognition. Thus, these embeddings have allowed sequence comparison to step into the 'midnight zone' of protein similarity, i.e. the region in which distantly related sequences have a seemingly random pairwise sequence similarity. The novelty of this work is in the particular combination of tools and sampling techniques that ascertained good performance comparable or better to existing state-of-the-art sequence comparison methods. Additionally, since this method does not need to generate alignments it is also orders of magnitudes faster. The code is available at https://github.com/Rostlab/EAT.

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA